Food detection

Objective:

This is a notebook that aims to detect the presence of food in images! In addition, the Grad-CAM is used to visualize the most important features of the image when the model is classifying.

Methodology:

5 approaches were used for training and evaluating:

Results

Training acc | Test accuracy:

  1. DNN: 92% | 69%
  2. VGG + fully: 98% | 90%
  3. VGG + SVM: 98% | 89%
  4. New CNN: 100% | 84%
  5. Bootstrap DTC: | **74%***

Introduction

Image classification tasks has long been studied and it is an important field of machine learning and artificial intelligence. Normally associated with complex models, detecting and correctly classifying a figure/image is not trivial for the computer as it is for us, humans. For that reason, several models exist nowadays and deep neural network gained its space with the advancements of micro-processors, overcoming time- and memory-constraints.

Despite the success of using convolutional neural networks for dechyphering and understanding the sublte meanings of an image, simple and light classification models (e.g. SVM, Linear Regression) have always caught the attention for their interpretability and are still explored to modelize high-dimensional problems (such image classification).

But the problem scope goes beyond of "how much complex a model is". For instance, Joutout et al. [2] proposed a SVM classifier for such purpose and, despite the "simplicity" of the model, data acquired to predict spherical fruits involves a laser-scanning to obtain reflectance and range precision, outcoming the fruit shape and color.

Particularly, recognizing the presence of food items on images is also a challenge and Jimenez et al. proposed one of the first methods back on 1999 [1].

Alghough the difficult to classify food items, as they are strogly related to color and shape [3], novel methods have been tested and combination of multiples CNN models can already predict Mediterranean Diet food items with an accuracy of 52,71% [4]

References:

[1] Joutou, T., Yanai, K.: A food image recognition system with multiple kernel learning. In IEEE International Conference in Image Processing, pp. 285–288 (2009)

[2] Farinella, G. M., Allegra, D., Stanco, F., & Battiato, S. (2015, September). On the exploitation of one class classification to distinguish food vs non-food images. In International Conference on Image Analysis and Processing (pp. 375 -383). Springer, Cham

[3] Farinella, G. M., Allegra, D., & Stanco, F. (2014, September). A benchmark dataset to study the representation of food images. In European Conference on Computer Vision (pp. 584 - 599). Springer, Cham

[4] Papathanail, Ioannis; Lu, Ya; Vasiloglou, Maria; Stathopoulou, Thomai; Ghosh, Arindam; Faeh, David; Mougiakakou, Stavroula (March 2021). FOOD RECOGNITION IN ASSESSING THE MEDITERRANEAN DIET: A HIERARCHICAL APPROACH (Unpublished). In: 14th International Conference on Advanced Technologies & Treatments for Diabetes

Data processing

Although the images source for training and testing our models comes all from the same dataset (TRAIN and TEST folder), we prepared 3 types of processed data in order to train the different models.

Regardless of the model, due to computational process limits, the images had to be resized in 1/4 the original (from 240x320 to 60x80).

Furthermore, for training the new CNN, the images had also to be converted to grayscale. Despite this is a topic still in discussion in the scientific community, the accuracy for both methods (RGB and grayscale) don't differ too much [5][6] (check the image below).

data_inet: for VGG16 model. input : (samples, 60, 80, 3)

data: for own architecture. input : (samples, 60, 80, 1)

data_stack : for simple DNN. input : (samples, 4800)

For more information about the differences training CNN with grayscale and RGB images:

[5] Convolutional neural network for human micro-Doppler classification

[6] Color-to-Grayscale: Does the Method Matter in Image Recognition?

Loss-accuracy-curves-of-RGB-grayscale-images.png

To save time, data were processed in a first time (refer to getalldata.py script) and downloaded for ulterior uses.

Code for Grad-Cam

The code for the Grad-CAM used can be found on the script gradcam.py

References:

  1. https://medium.com/@daniel.reiff2/understand-your-algorithm-with-grad-cam-d3b62fce353

  2. https://medium.com/@stepanulyanin/implementing-grad-cam-in-pytorch-ea0937c31e82

  3. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra https://arxiv.org/abs/1610.02391

Function to make single predictions while visualizing the image.

Two possibilities were taken into account with respect to different models predicting:

While the new CNN model could predict directly from the image input, the pre-trained models had first to pass the image through the base convolutional layers before predicting.

1st approach: Simple DNN

If CNN are largely used to perform image classification tasks, DNN were the basis for the first models learning to recognize images. For the sake of curiosity, a simple (really simple) DNN is built here.

It is composed of no more than dense layers!

</br>Parameters

batch-size = 32

rlrop patience = 50

optimizer = Adam()

epochs = 500

</br>Tuning the parameters

In the first approaches, the training accuracy increased until 0.8, falling suddenly to a stead value of 0.5 until the end of training.

Taking into account the problem of vanishing gradients and local maximum, a possible solution for such behavior could be a variant learning rate, that takes into account an "unchanged accuracy value" through a certain period of training.

For that reason, a ReduceLRonPlateau algorithm was used in order to scheduled change the learning rate when reaching a plateau of learning.

This solution solved the problem with a factor change of 0.001 in the learning rate for each 50 epochs showing no accuracy improvement.

2nd approach: pre-trained VGG16 + training fully-connected layer

</br> Transfer Learning

"Standing on the shoulder of giants"

Although all the pre-trained models for image classification should have a great accuracy for most part of the classification tasks, the use of vgg16 was prioritized as its training dataset (Imagenet) has food images[7].

References:

[7] https://arxiv.org/pdf/2004.03357.pdf

pretrained models with keras

transfer with CNN

TensorFlow guide

For this model, we'll use the data in the 3 channels format (60, 80, 3) asvgg16 was trained with 3 channels

Extract the features for both TRAINING and VALIDATION

Train the fully connecte layers

Visualize predictions

3rd approach : VGG16 + SVM on top

Visualize predictions

Visualize GRAD-CAM for the VGG16 pre trained model

grad-cam for all conv layers in the model

Keras own architecture CNN

Visualize GRAD-CAM

Visualize grad-cam for multiple conv layers

Interestingly, we can see that the first conv layer conv2d_17 has a more defined gradient output (a more homogenous gradient of color) whereas the subsequents layers, seem to learn other patterns beyond the food shape (we can see some activation importance coming from the plate as well!!)

All confusion matrix

Note: please refer to the code WITHOUT comments for visualizing the confusion matrix together!

Comparison for the 2 CNN models

New CNN attempts

Due to time and computational process constraints, I'll build several new cnn model with only 20 epochs, changing some parameters like:

** the kernel_size:

Borrowing the concept of human learning from baby to adult when deciphering an image and adapting for this task, I will try some different patterns through the layers of the model, establishing 3 parts of "learning":

  1. baby: where we first have a broad picture of an image without any categorization of shapes and colors (bigger kernel sizes e.g. 5 or 7)
  2. child: where we can define and characterize lines, countours and colors (small kernel sizes e.g. 3, or 1)
  3. adult: where we can generalize without looking to specific parts but rather to the whole context and are able to infer concepts, ideas and categorizations (bigger kernel sizes e.g. 5 or 7)

Below you can find the trained models and their parameters.

modelstable.png

Lets now check some graphs for the val_loss and val_accuracy for these round of trained models.

If you can visualize the INTERATIVE Plotly graph, check the satatic one below

newplot (12).png

For models 5 and 11:

newplot (13).png

From these first round of attempts, a final CNN version (called own2) will be trained with more layers added and 50 epochs training with data augmentation.

Although the choosing of the model doesn't follow any strict criterium, I'll give priority to data augmentation and to not overfitting.

Furthermore, as the models were simple and trained with only 20 epochs, I don't think that they would have a great performance difference between them WHEN training with more epochs in a higher complex architecture.

Neverthelles, respecting the above criterium, models 5 and 11 are chosen and a blend of their parameters will be made.

I added now 8 more layers (4 conv2d and 4 batch_normalization) and boosted up the final dense layer's node (from 32 to 128)

Final accuracies:

tr acc : 0.89

val acc : 0.75

If you can visualize the INTERATIVE Plotly graph, check the satatic one below

newplot (14).png

Comparing all CNN models with one image

BONUS: Bootstrap sampling with Decision Tree Classifier

What if we could drastically reduce the input variables and still have a reasonable model to predict?

While DNN are so commonly used when task is to detect and classify images, some other approaches exist and sometimes are worth checking!

For the sake of curiosity, let's try a bootstrap sampling with 5 samples, only 150 input variables (PCA) and half the training data!

As we are going to train a Decision Tree Classifier, we'll be using the data_stack data, as image is unrolled in one dimension.

Reducing the dimension using PCA.

Let's check how much of information we conserve using 150 variables (a drastic change of 97% of leave out)

With 150 components, our data conserve around 81% of the information!

Now, we transform our original data so to take only these 150 components:

XAI: SHAP values

Another interesting approach to explain a ML model comes with SHAP values.

image.png

Nevertheless, due to Tensorflow version incompability, SHAP values couldn't be used here as downgrading Tensorflow would require remodeling all the code for training the models.

references:

https://shap-lrjball.readthedocs.io/en/latest/example_notebooks/deep_explainer/Front%20Page%20DeepExplainer%20MNIST%20Example.html

threads reporting the issues:

https://github.com/tensorflow/probability/issues/540

https://github.com/slundberg/shap/issues/2189

https://github.com/slundberg/shap/pull/2355